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Abstract 


Several authors have recently developed risk-sensitive policy gradient methods 
that augment the standard expected cost minimization problem with a measure of 
variability in cost. These studies have focused on specific risk-measures, such as 
the variance or conditional value at risk (CVaR). In this work, we extend the pol¬ 
icy gradient method to the whole class of coherent risk measures, which is widely 
accepted in finance and operations research, among other fields. We consider 
both static and time-consistent dynamic risk measures. Eor static risk measures, 
our approach is in the spirit of policy gradient algorithms and combines a standard 
sampling approach with convex programming. Eor dynamic risk measures, our ap¬ 
proach is actor-critic style and involves explicit approximation of value function. 
Most importantly, our contribution presents a unified approach to risk-sensitive 
reinforcement learning that generalizes and extends previous results. 


1 Introduction 

Risk-sensitive optimization considers problems in which the objective involves a risk measure of 
the random cost, in contrast to the typical expected cost objective. Such problems are important 
when the decision-maker wishes to manage the variability of the cost, in addition to its expected 
outcome, and are standard in various applications of finance and operations research. In reinforce¬ 
ment learning (RE) ll^ . risk-sensitive objectives have gained popularity as a means to regularize 
the variability of the total (discounted) cost/reward in a Markov decision process (MDP). 

Many risk objectives have been investigated in the literature and applied to RE, such as the cele¬ 
brated Markowitz mean-variance model im, Value-at-Risk (VaR) and Conditional Value at Risk 
(CVaR) II 22 I [l5l |26] [IS (TO] [36l. The view taken in this paper is that the preference of one risk 
measure over another is problem-dependent and depends on factors such as the cost distribution, 
sensitivity to rare events, ease of estimation from data, and computational tractability of the op¬ 
timization problem. However, the highly influential paper of Artzner et al. II identified a set of 
natural properties that are desirable for a risk measure to satisfy. Risk measures that satisfy these 
properties are termed coherent and have obtained widespread acceptance in financial applications, 
among others. We focus on such coherent measures of risk in this work. 

Eor sequential decision problems, such as MDPs, another desirable property of a risk measure is 
time consistency. A time-consistent risk measure satisfies a “dynamic programming” style property; 
if a strategy is risk-optimal for an n-stage problem, then the component of the policy from the f-th 
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time until the end (where t < n) is also risk-optimal (see principle of optimality in Q). The recently 
proposed class of dynamic Markov coherent risk measures OOl satisfies both the coherence and time 
consistency properties. 

In this work, we present policy gradient algorithms for RL with a coherent risk objective. Our 
approach applies to the whole class of coherent risk measures, thereby generalizing and unifying 
previous approaches that have focused on individual risk measures. We consider both static coherent 
risk of the total discounted return from an MDP and time-consistent dynamic Markov coherent 
risk. Our main contribution is formulating the risk-sensitive policy-gradient under the coherent-risk 
framework. More specifically, we provide: 

• A new formula for the gradient of static coherent risk that is convenient for approximation 
using sampling. 

• An algorithm for the gradient of general static coherent risk that involves sampling with 
convex programming and a corresponding consistency result. 

• A new policy gradient theorem for Markov coherent risk, relating the gradient to a suitable 
value function and a corresponding actor-critic algorithm. 

Several previous results are special cases of the results presented here; our approach allows to re¬ 
derive them in greater generality and simplicity. 

Related Work Risk-sensitive optimization in RL for specific risk functions has been studied re¬ 
cently by several authors. 18] studied exponential utility functions, ll22l . iTSll . Il26l studied mean- 
variance models, Co), 136) studied CVaR in the static setting, and i25i,ini studied dynamic coher¬ 
ent risk for systems with linear dynamics. Our paper presents a general method/or the whole class 
of coherent risk measures (both static and dynamic) and is not limited to a specific choice within 
that class, nor to particular system dynamics. 

Reference El showed that an MDP with a dynamic coherent risk objective is essentially a ro¬ 
bust MDP. The planning for large scale MDPs was considered in 137) . using an approximation of 
the value function. For many problems, approximation in the policy space is more suitable (see, 
e.g., mi). Our sampling-based RL-style approach is suitable for approximations both in the policy 
and value function, and scales-up to large or continuous MDPs. We do, however, make use of a 
technique of llJTl in a part of our method. 

Optimization of coherent risk measures was thoroughly investigated by Ruszczynski and 
Shapiro 1311 (see also 1^ ) for the stochastic programming case in which the policy parameters 
do not affect the distribution of the stochastic system (i.e., the MDP trajectory), but only the reward 
function, and thus, this approach is not suitable for most RL problems. For the case of MDPs and 
dynamic risk, 1^ proposed a dynamic programming approach. This approach does not scale-up 
to large MDPs, due to the “curse of dimensionality”. For further motivation of risk-sensitive policy 
gradient methods, we refer the reader to l22l[T5lfTOl[^ . 

2 Preliminaries 

Consider a probability space {Vl,F,Pe), where is the set of outcomes (sample space), is a 
cr-algebra over fl representing the set of events we are interested in, and Pg G B, where B := 

'^he set of probability distributions, is a probability measure over P 
parameterized by some tunable parameter 9 G In the following, we suppress the notation of 9 
in 6*-dependent quantities. 

To ease the technical exposition, in this paper we restrict our attention to finite probability spaces, 
i.e., n has a finite number of elements. Our results can be extended to the Lp-normed spaces without 
loss of generality, but the details are omitted for brevity. 

Denote by Z the space of random variables Z : O i—> (—oo, oo) defined over the probability space 
(O, Pg). In this paper, a random variable Z € Z is interpreted as a cost, i.e., the smaller the 
realization of Z, the better. For Z,JV G Z, we denote by Z < IV the point-wise partial order, 
i.e., Z(uj) < W{uj) for all w G O. We denote by E^[Z] = Pe(a;)^(a;)Z(a;) a ^-weighted 

expectation of Z. 
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An MDP is a tuple A4 = (A',A,C,P,j,xo), where X and A are the state and action spaces; 
C{x) G [—Cmax, C'max] IS a bounded, deterministic, and state-dependent cost; P{-\x, a) is the tran¬ 
sition probability distribution; 7 is a discount factor; and xq is the initial stateQ Actions are chosen 
according to a 0-parameterized stationary Marko\0policy /ie( - |x). We denote hy xo,ao,... ,XT,aT 
a trajectory of length T drawn by following the policy pg in the MDP. 

2.1 Coherent Risk Measures 

A risk measure is a function p : Z ^ M. that maps an uncertain outcome Z to the extended real line 
M U {-foo, —oo}, e.g., the expectation E [Zj or the conditional value-at-risk (CVaR) min^gR {v + 
^E[{Z — z/)+] }. A risk measure is called coherent, if it satisfies the following conditions for all 

z,w gz^-. 

A1 Convexity: VA e [0,1], p{xz + { 1 - X)W) < Xp{Z) -f (1 - X)piW); 

A2 Monotonicity: if Z < W, then p{Z) < p{W); 

A3 Translation invariance: VaSK, p{Z -f a) = p{Z) -f a; 

A4 Positive homogeneity: if A > 0, then p{XZ) = Xp{Z). 

Intuitively, these condition ensure the “rationality” of single-period risk assessments: A1 ensures 
that diversifying an investment will reduce its risk; A2 guarantees that an asset with a higher cost 
for every possible scenario is indeed riskier; A3, also known as ‘cash invariance’, means that the 
deterministic part of an investment portfolio does not contribute to its risk; the intuition behind A4 
is that doubling a position in an asset doubles its risk. We further refer the reader to El for a more 
detailed motivation of coherent risk. 

The following representation theorem ll^ shows an important property of coherent risk measures 
that is fundamental to our gradient-based approach. 

Theorem 2.1. A risk measure p : Z ^ M. is coherent if and only if there exists a convex bounded 
and closed setlA C B such thal^ 

piZ) = max EffZl. (1) 

€:«P0GW(P(,) 

The result essentially states that any coherent risk measure is an expectation w.r.t. a worst-case den¬ 
sity function ^Pg, chosen adversarially from a suitable set of test density functions U{Pg), referred 
to as risk envelope. Moreover, it means that any coherent risk measure is uniquely represented by its 
risk envelope. Thus, in the sequel, we shall interchangeably refer to coherent risk-measures either 
by their explicit functional representation, or by their corresponding risk-envelope. 

In this paper, we assume that the risk envelop U{Pg) is given in a canonical convex programming 
formulation, and satisfies the following conditions. 

Assumption 2.2 (The General Form of Risk Envelope). For each given policy parameter 9 G 
the risk envelope lA of a coherent risk measure can be written as 

U(Pe) = hPB. 9 A^,Pe) = 0,\/eG£, m,Pe)<0,AiGl, ^ CHPe(cu) = 1, CH > o|, (2) 

where each constraint ge{f,,Pg) is an affine function in each constraint fi(f,Pg) is a convex 
function in and there exists a strictly feasible point £ and T here denote the sets of equality 
and inequality constraints, respectively. Furthermore, for any given ^ G B, fi{^,p) and ge{^,p) are 
twice differentiable in p, and there exists a M > 0 such that 


max 


{ max 
iei 


dp{uj) 


max 


dgei^,p) 

dp{uj) 


< M,\/uj G Q.. 


'Our results may easily be extended to random costs, state-action dependent costs, and random initial states. 

^For the dynamic Markov risk we study, an optimal policy is stationary Markov, while this is not necessarily 
the case for the static risk. Our results can be extended to history-dependent policies or stationary Markov 
policies on a state space augmented with the accumulated cost. The latter has shown to be sufficient for 
optimizing the CVaR risk 0 

^When we study risk in MDPs, the risk envelop WjPs) in Eq.j^also depends on the state x. 
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Assumption |2.2| implies that the risk envelope U{Pg) is known in an explicit form. From Theorem 
6.6 of ||32l, in the case of a finite probability space, p is a coherent risk if and only if U{Pg) is a 
convex and compact set. This justifies the affine assumption of Pe the convex assumption of fi. 
Moreover, the additional assumption on the smoothness of the constraints holds for many popular 
coherent risk measures, such as the CVaR, the mean-semi-deviation, and spectral risk measures HI. 

2.2 Dynamic Risk Measures 

The risk measures defined above do not take into account any temporal structure that the random 
variable might have, such as when it is associated with the return of a trajectory in the case of 
MDPs. In this sense, such risk measures are called static. Dynamic risk measures, on the other hand, 
explicitly take into account the temporal nature of the stochastic outcome. A primary motivation for 
considering such measures is the issue of time consistency, usually defined as follows ll^ : if a 
certain outcome is considered less risky in all states of the world at stage t + 1, then it should also 
be considered less risky at stage t. Example 2.1 in fl^ shows the importance of time consistency 
in the evaluation of risk in a dynamic setting. It illustrates that for multi-period decision-making, 
optimizing a static measure can lead to “time-inconsistent” behavior. Similar paradoxical results 
could be obtained with other risk metrics; we refer the readers to ll^ and lfT 6 l for further insights. 

Markov Coherent Risk Measures. Markov risk measures were introduced in IMI and are a useful 
class of dynamic time-consistent risk measures that are particularly important for our study of risk 
in MDPs. For a T-length horizon and MDP M, the Markov coherent risk measure pt{M) is 

Pt{M) = C(xo) + jp^C(xi) -f ... -I- 'yp{c{xT-i) -f yp{C{xT )))). (3) 

where p is a static coherent risk measure that satisfies Assumption |2.2| and xq, ..., is a trajectory 
drawn from the MDP A1 under policy ptg. It is important to note that in ([^, each static coherent risk 
p at state x G X is induced by the transition probability Pg{-\x) = a)p,eia\x). We 

also define Poo(Al) = limr-^oo which is well-defined since 7 < 1 and the cost is bounded. 

We further assume that p in Q is a Markov risk measure, i.e., the evaluation of each static coherent 
risk measure p is not allowed to depend on the whole past. 

3 Problem Formulation 

In this paper, we are interested in solving two risk-sensitive optimization problems. Given a random 
variable Z and a static coherent risk measure p as defined in Section]^ the static risk problem (SRP) 
is given by 

min p{Z). (4) 

8 

For example, in an RL setting, Z may correspond to the cumulative discounted cost Z = C{xo) + 
jC(xi) -f • • • -I- j'^C(xt} of a trajectory induced by an MDP with a policy parameterized by 0. 

For an MDP A4 and a dynamic Markov coherent risk measure px as defined by Eq. the dynamic 
risk problem (DRP) is given by 

min Poo(AI). (5) 

9 

Except for very limited cases, there is no reason to hope that neither the SRP in Q nor the DRP 
in 0 should be tractable problems, since the dependence of the risk measure on 9 may be complex 
and non-convex. In this work, we aim towards a more modest goal and search for a locally optimal 
9. Thus, the main problem that we are trying to solve in this paper is how to calculate the gradients 
of the SRP’s and DRP’s objective functions 

Vgp{Z) and '^ePoo{M). 

We are interested in non-trivial cases in which the gradients cannot be calculated analytically. In 
the static case, this would correspond to a non-trivial dependence of Z on 9. For dynamic risk, we 
also consider cases where the state space is too large for a tractable computation. Our approach for 
dealing with such difficult cases is through sampling. We assume that in the static case, we may 
obtain i.i.d. samples of the random variable Z. For the dynamic case, we assume that for each state 
and action (x,a) of the MDP, we may obtain i.i.d. samples of the next state x' ~ P(ja::,a). We 
show that sampling may indeed be used in both cases to devise suitable estimators for the gradients. 
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To finally solve the SRP and DRP problems, a gradient estimate may be plugged into a standard 
stochastic gradient descent (SGD) algorithm for learning a locally optimal solution to Q and (|^. 
From the structure of the dynamic risk in Eq.l^ one may think that a gradient estimator for p{Z) may 
help us to estimate the gradient Vgpoo(-Ad)- Indeed, we follow this idea and begin with estimating 
the gradient in the static risk case. 


4 Gradient Formula for Static Risk 

In this section, we consider a static coherent risk measure p{Z) and propose sampling-based es¬ 
timators for Vep{Z). We make the following assumption on the policy parametrization, which is 
standard in the policy gradient literature ESI. 

Assumption 4.1. The likelihood ratio Vg log P{uj) is well-defined and bounded for all w € 12. 


Moreover, our approach implicitly assumes that given some w S 12, VelogP(w) may be easily 
calculated. This is also a standard requirement for policy gradient algorithms ESI and is satisfied 
in various applications such as queueing systems, inventory management, and financial engineering 
(see, e.g., the survey by Fu El). 


Using Theorem 2.1 and Assumption 2.2 for each 9, we have that p{Z) is the solution to the con¬ 
vex optimization problem Q (for that value of 9). The Lagrangian function of Q, denoted by 
A^, A^, A^), may be written as 


L,(G A^ A"; A^) =Y.^[uj)Pe{w)Z{w)-\^ ^ a^)Pe{u)-l\ xHe)g.{f,Pe)-J2 

( 6 ) 

The convexity of E) and its strict feasibility due to Assumption |2.2| implies that , A^, A^) 

has a non-empty set of saddle points S. The next theorem presents a formula for the gradient 
Vop{Z). As we shall subsequently show, this formula is particularly convenient for devising sam¬ 
pling based estimators for Vgp{Z). 


Theorem 4.2. Let Assumptions 
o/@, we have 


2.2 


and 


4.1 


hold. For any saddle point (^g, Ag’^, A^’^, A^’^) 


e 5 


^ep(Z) — Eq 


V,logPH(Z-A;’^) 




The proof of this theorem, given in the supplementary material, involves an application of the Enve¬ 
lope theorem ED and a standard Tikelihood-ratio’ trick. We now demonstrate the utility of Theorem 
4.2 with several examples in which we show that it generalizes previously known results, and also 


enables deriving new useful gradient formulas. 


4.1 Example 1: CVaR 

The CVaR at level a G [0,1] of a random variable Z, denoted by pcvaR(.^; a), is a very popular 
coherent risk measure 1^ . defined as 

PcvaR(Z;a) = inf {f -|- a“^E [(Z - f)+] }. 

When Z is continuous, pcvaR(Z; a) is well-known to be the mean of the a-tail distribution of Z, 
ElZj Z > qa\, where is a (1 — a)-quantile of Z. Thus, selecting a small a makes CVaR partic¬ 
ularly sensitive to rare, but very high costs. 

The risk envelope for CVaR is known to be 0^ U = {^Pe ■ G 

[0,a”^], E<.uen ^{(jj)Pg{oj) = l}. Eurthermore, show that the saddle points of (|^ satisfy 
Q{uj) = a~^ when Z{uj) > A^’^, and Q{uj) = 0 when Z{uj) < X*g^, where Ag’^ is any (1 — a)- 
quantile of Z. Plugging this result into Theorem |4.2[ we can easily show that 

VepcvaR(^;a) = E [Ve logP(a;)(^ - q^)] Z{uj) > qa]. 

This formula was recently proved in ll^ for the case of continuous distributions by an explicit 
calculation of the conditional expectation, and under several additional smoothness assumptions. 
Here we show that it holds regardless of these assumptions and in the discrete case as well. Our 
proof is also considerably simpler. 
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4.2 Example 2: Mean-Semideviation 

The semi-deviation of a random variable Z is defined as SD[Z] = (E [(Z — E The 

semi-deviation captures the variation of the cost only above its mean, and is an appealing alternative 
to the standard deviation, which does not distinguish between the variability of upside and downside 
deviations. For some a G [0,1], the mean-semideviation risk measure is defined as Pmsd(^; ct) = 
E [Z] -f aSD[Z], and is a coherent risk measure We have the following result: 

Proposition 4.3. Under Assumption \4.1\ with VgE [Z] = E [Ve log P(a;)Z], we have 

a) = V.E [Z, + [(Z-E |Z|),(V. logPM(Z-E |Z|)-V.E [Z|)] 


This proposition can be used to devise a sampling based estimator for VgpMSD(^; ck) by replacing 
all the expectations with sample averages. The algorithm along with the proof of the proposition are 
in the supplementary material. In Sectionj^we provide a numerical illustration of optimization with 
a mean-semideviation objective. 

4.3 General Gradient Estimation Algorithm 

In the two previous examples, we obtained a gradient formula by analytically calculating the La- 
grangian saddle point (|^ and plugging it into the formula of Theorem |4.2| We now consider a 
general coherent risk p(Z) for which, in contrast to the CVaR and mean-semideviation cases, the 
Lagrangian saddle-point is not known analytically. We only assume that we know the structure of the 
risk-envelope as given by Q. We show that in this case, Wep{Z) may be estimated using a sample 
average approximation (SAA; of the formula in Theorem |4.2| 

Assume that we are given N i.i.d. samples uji ^ Pg, i = and let Pg-N^uj) = 

^ ^ denote the corresponding empirical distribution. Also, let the sample risk en¬ 

velope hl{Pg.ff) be defined according to Eq. rawith Pg replaced by Pg-N- Consider the following 
SAA version of the optimization in Eq. [T] 

PAr(Z) = max Pg.NiuJ^)^{uJ^)Z{uJi). (7) 

i--iPe;N&U(Pe.,N) 

1^1,...,N 

Note that Q defines a convex optimization problem with 0{N) variables and constraints. In 
the following, we assume that a solution to Q may be computed efficiently using standard con¬ 
vex programming tools such as interior point methods Q. Let denote a solution to (|7]i and 

^*g^, ^g’-Nt w denote the corresponding KKT multipliers, which can be obtained from the con¬ 
vex programming algorithm S- We propose the following estimator for the gradient-based on 
Theorem 14.21 

N 

Ve;Arp(Z) =Y,Pe-,N{uJi)Ce-N{^i)^e\ogP{uj,){Z{uj,) - X*f^) (8) 

- - E 0 MCg-^: Pe;N) ■ 

e^S i^X 


Thus, our gradient estimation algorithm is a two-step procedure involving both sampling and convex 
programming. In the following, we show that under some conditions on the set U{Pg), Wg-NPiZ) 
is a consistent estimator of S/gp{Z). The proof has been reported in the supplementary material. 


Proposition 4.4. Let Assumptions 2.2 and 4.1 hold. Suppose there exists a compact set C = C^xCx 
such that: (I) The set of Lagrangian saddle points S G C is non-empty and bounded. (II) The 
functions Pg) for all e G £ and fi(^, Pg) for all i G T are finite-valued and continuous (in 
on Cj. (Ill) For N large enough, the set Sjq is non-empty and G C w.p. 1. Further assume that: 
(IV) If ^NPe-N G U(Pg.N) and converges w.p. 1 to a point then ^Pg G U(Pg). We then have 
f/iaf limjv_j.oo Pn{Z) = p{Z) andlvmjq^ao'^ 9 -n p{Z) = Vgp(Z) w.p. 1. 


The set of assumptions for Proposition |4.4| is large, but rather mild. Note that (I) is implied by 
the Slater condition of Assumption |2.2| Eor satisfying (III), we need that the risk be well-defined 
for every empirical distribution, which is a natural requirement. Since Pg-^M always converges to Pg 
uniformly on fl, (IV) essentially requires smoothness of the constraints. We remark that in particular. 
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constraints (I) to (IV) are satisfied for the popular CVaR, mean-semideviation, and spectral risk 
measures. 


To summarize this section, we have seen that by exploiting the special structure of coherent risk 
measures in Theorem |2.1 and by the envelope-theorem style result of Theorem|4.2[ we were able to 


derive sampling-based, likelihood-ratio style algorithms for estimating the policy gradient Vgp{Z) 
of coherent static risk measures. The gradient estimation algorithms developed here for static risk 
measures will be used as a sub-routine in our subsequent treatment of dynamic risk measures. 


5 Gradient Formula for Dynamic Risk 


In this section, we derive a new formula for the gradient of the Markov coherent dynamic risk mea- 
sure, S/ePoo{-M). Our approach is based on combining the static gradient formula of Theorem 4.2 
with a dynamic-programming decomposition of Poo{-M). 


The risk-sensitive value-function for an MDP A4 under the policy 9 is defined as Vg{x) = 
Poo{f^\xo = x), where with a slight abuse of notation, Poo{f^\xo = x) denotes the Markov- 
coherent dynamic risk in ([^ when the initial state xq is x. It is shown in lf30ll that due to the structure 
of the Markov dynamic risk Poo(Al), the value function is the unique solution to the risk-sensitive 
Bellman equation 

14(x) = C'(x) 4-7 max E5[Ve(a;')], (9) 

iPe{-\x)eU(x,Pe{-\x)) 

where the expectation is taken over the next state transition. Note that by definition, we have 
Poo(M) = Vg(xo), and thus, VepaoiM) = VeVg{xo). 


We now develop a formula for WgVg{x)', this formula extends the well-known “policy gradient 
theorem’ ’01 El, developed for the expected return, to Markov-coherent dynamic risk measures. 
We make a standard assumption, analogous to Assumption |4.1 | of the static case. 

Assumption 5.1. The likelihood ratio Vg log pgipfx) is well-defined and bounded for all x ^ X 
and a G A. 

For each state x £ X, let {Q ,,., Ag’f, Ag’^) denote a saddle point of corresponding to the 
state X, with Pg{-\x) replacing Pg in (|^ and Vg replacing Z. The next theorem presents a formula 
for WgVg{x)', the proof is in the supplementary material. 

Theorem 5.2. Under Assumptions\2.2\and\5.1\ we have 


\/Vg{x) = Eji 


oo 

^ 7^0 log fj,e{at\xt)h 0 {xt,at) 
_t^o 


Xq = X 


where E^» [•] denotes the expectation w.r.t. trajectories generated by the Markov chain with transition 
probabilities Pg{‘\x)Q ^(•), and the stage-wise cost function hg{x, a) is defined as 


hg{x,a) = C{x) + y^P{x'\x,a)CeA^') 

x'£X 


^Ve{x)-\ 


*,v 

9,x 


iGX 


dp{x') 


e^S 


dge{G,:^,p) 

dp{x') 


Theorem 5.2 may be used to develop an actor-critic style Giini sampling-based algorithm for 
solving the DRP problem Q, composed of two interleaved procedures: 


Critic: For a given policy 9, calculate the risk -sensitive value function Vg, and 
Actor: Using the critic’s Vg and Theorem 5.2 estimate Vgpao{M.) and update 9. 


Space limitation restricts us from specifying the full details of our actor-critic algorithm and its 
analysis. In the following, we highlight only the key ideas and results. For the full details, we refer 
the reader to the full paper version, provided in the supplementary material. 


For the critic, the main challenge is calculating the value function when the state space X is large 
and dynamic programming cannot be applied due to the ‘curse of dimensionality’. To overcome 
this, we exploit the fact that Vg is equivalent to the value function in a robust MDP ll24l and modify 
a recent algorithm in iJTl to estimate it using function approximation. 


For the actor, the main challenge is that in order to estimate the gradient using Thm. 5.2 we need to 
sample from an MDP with ^J-weighted transitions. Also, hg{x, a) involves an expectation for each 


7 















Figure 1: Numerical illustration - selection between 3 assets. A; Probability density of asset return. 
B,C,D: Bar plots of the probability of selecting each asset vs. training iterations, for policies tti, tt 2 , 
and TTa, respectively. At each iteration, 10,000 samples were used for gradient estimation. 

s and a. Therefore, we propose a two-phase sampling procedure to estimate Wg in which we first 
use the critic’s estimate of Vg to derive and sample a trajectory from an MDP with -weighted 
transitions. For each state in the trajectory, we then sample several next states to estimate hg{x, a). 

The convergence analysis of the actor-critic algorithm and the gradient error incurred from function 
approximation of Vg are reported in the supplementary material. 

6 Numerical Illustration 

In this section, we illustrate our approach with a numerical example. The purpose of this illustration 
is to emphasize the importance of flexibility in designing risk criteria for selecting an appropriate 
risk-measure - such that suits both the user’s risk preference and the problem-specific properties. 

We consider a trading agent that can invest in one of three assets (see FigurefTlfor their distributions). 
The returns of the first two assets, A1 and A2, are normally distributed; Al ~ A/^(l, 1) and A2 ^ 
JV(4, 6). The return of the third asset A3 has a Pareto distribution: f{z) = flz > 1, with a = 
1.5. The mean of the return from A3 is 3 and its variance is infinite; such heavy-tailed distributions 
are widely used in financial modeling fT7\ . The agent selects an action randomly, with probability 
P{Ai) (X exp(0i), where 0 G is the policy parameter. We trained three different policies tti, 7r2, 
and TTa. Policy tti is risk-neutral, i.e., maxg E [Z], and it was trained using standard policy gradient 
M- Policy 712 is risk-averse and had a mean-semideviation objective maxg E [Z] — SD[Z], and 
was trained using the algorithm in Section]^ Policy tts is also risk-averse, with a mean-standard- 
deviation objective, as proposed in llTSl |2^ . maxgE [Z] — ■y/Var[Z’], and was trained using the 
algorithm of llTSll . For each of these policies. Figure [T] shows the probability of selecting each asset 
vs. training iterations. Although A2 has the highest mean return, the risk-averse policy 772 chooses 
A3, since it has a lower downside, as expected. However, because of the heavy upper-tail of A3, 
policy TTa opted to choose Al instead. This is counter-intuitive as a rational investor should not avert 
high returns. In fact, in this case A3 stochastically dominates Al ifTSl . 

7 Conclusion 

We presented algorithms for estimating the gradient of both static and dynamic coherent risk mea¬ 
sures using two new policy gradient style formulas that combine sampling with convex program¬ 
ming. Thereby, our approach extends risk-sensitive RL to the whole class of coherent risk measures, 
and generalizes several recent studies that focused on specific risk measures. 

On the technical side, an important future direction is to improve the convergence rate of gradient 
estimates using importance sampling methods. This is especially important for risk criteria that are 
sensitive to rare events, such as the CVaR 0. 

From a more conceptual point of view, the coherent-risk framework explored in this work provides 
the decision maker With flexibility in designing risk preference. As our numerical example shows, 
such flexibility is important for selecting appropriate problem-speciflc risk measures for managing 
the cost variability. However, we believe that our approach has much more potential than that. 

In almost every real-world application, uncertainty emanates from stochastic dynamics, but also, 
and perhaps more importantly, from modeling errors (model uncertainty). A prudent policy should 
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protect against both types of uncertainties. The representation duality of coherent-risk (Theorem 
[ 23 , naturally relates the risk to model uncertainty. In ll24l . a similar connection was made between 
model-uncertainty in MDPs and dynamic Markov coherent risk. We believe that by carefully shap¬ 
ing the risk-criterion, the decision maker may be able to take uncertainty into account in a broad 
sense. Designing a principled procedure for such risk-shaping is not trivial, and is beyond the scope 
of this paper. However, we believe that there is much potential to risk shaping as it may be the key 
for handling model misspecification in dynamic decision making. 
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A Proof of Theorem 14.21 


First note from Assumption |2.2| that 

(i) Slater’s condition holds in the primal optimization problem Q, 

(ii) Le(e,A^,A^A^) is convex in ^ and concave in (A^, A^, A^). 


Thus by the duality result in convex optimization 0, the above conditions imply 
strong duality and we have p{Z) = max£>o min;^7 ° Ls{i, A^, A^, A^) = 

nainAr’,A^>o,A£^ inax 5 >o A^, A^, A^). From Assumption 


2.2 


one can also see that the 


family of functions A'^, A^, xrxrI^i xBi^i is equi-differentiable in 

e, L,(C,A^,A^A^) is Lipschitz, as a result, an absolutely continuous function in 9, and thus, 
VeLe(^, A^, A^, A^) is continuous and bounded at each (^, A^, A^, A^). Then for every selection 
of saddle point (CJ, Ag’^, A^’^, A^’^) e 5 of 0, using the Envelop theorem for saddle-point 
problems (see Theorem 4 of 11211 '). we have 


V^max min = 

^>0 A^,A^>0,A^ 




( 10 ) 


The result follows by writing the gradient in ( [T0| explicitly, and using the likelihood-ratio trick: 

£,{uj)VePe{uj) = logP(cc) (Z(cc)-A^) , 


Yj,{^)VePe{u:)Z{u:)-\^Y.' 


OJ^Cl 

where the last equality is justified by Assumption|4. 1| 


B Gradient Results for Static Mean-Semideviation 


In this section we consider the mean-semideviation risk measure, defined as follows: 

Pmsd(^) = E [Z] + c (E [{Z - E [Z])l ]), (11) 

Following the derivation in 1321 . note that (E [|Zp] = ||^|j 2 , where |j • ||2 denotes the L 2 norm 

of the space £ 2 ( 1 !, P,Pe)- The norm may also be written as: 

||Z||2= sup {^,Z), 

II«I|2<1 


and hence 

{E[{Z-E[Z])l]f^=snp{^,{Z-E[Z])+}= sup {^,Z-E[Z]) 

Il«ll2<i II«I|2 <i.5>o 

= sup (e-E[C],Z). 
II«I|2<i.5>o 

It follows that Eq. ([T]) holds with 

U = {£,'&Z*, ^' = l + c^-cE[e], ||^||,<1, e>0}. 

Eor this case it will be more convenient to write Eq. Q in the following form 

Pmsd(-^) = sup (1-fc^-cE[^],Z). (12) 

ll«IU<i,«>o 

Let ^ denote an optimal solution for ( [T^ . In l32l it is shown that ^ is a contact point of (Z —E [^])+, 
that is 

^ G argmax{(5, (Z-E[Z])+) : ||^||2 < 1}, 

and we have that 

{Z-E[Z])+ iZ-E[Z])+ 




(13) 


||(Z-E[Z])+||2 SD(Z) ■ 

Note that ^ is not necessarily a probability distribution, but for c G [0,1], it can be shown 1^ that 
1 -f — cE always is. 


In the following we show that ^ may be used to write the gradient VepMSD(^) as an expectation, 
which will lead to a sampling algorithm for the gradient. 
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(14) 


Proposition B.l. Under Assumption \4. 1 \ we have that 

VePMSD{Z) = VeE [Z] + g^E [(Z - E [Z])+ (V^ \ogP{uj)iZ - E [Z]) - V,E [Z])], 

and, according to the standard likelihood-ratio method, 

VeE[Z] =E[VelogP(a;)Z]. 

Proof. Note that in Eq. 0 the constraints do not depend on 0. Therefore, using the envelope 
theorem we obtain that 

V,p(Z) = Ve(l + cC-cE[e],Z) 

= Ve(l, Z) + cVe{l Z) - cVe(E , Z). 

We now write each of the terms in Eq. ( [T4l l as an expectation. We start with the following standard 
likelihood-ratio result; 

Ve(l,Z) = V,E[Z] =E[V,logP(w)Z]. 

Also, we have that 

(E [el,Z)=E[^-]E[Z], 

therefore, by the derivative of a product rule: 

Ve(E [e], Z) = VeE [e]E [Z] + E [C] VeE [Z]. 

By the likelihood-ratio trick and Eq. 0 we have that 

V.E [e] =g^E[V,logP(a;)(Z-E[Z])+]. 

Also, by the likelihood-ratio trick 

VeE [iz] = E [VelogP(w)CZ] . 

Plugging these terms back in Eq. ( [T4l l, we have that 
Vep{Z) = VeE [Z] + cVeE [^'Z] - cV^E [e]E [Z] - cE V^E [Z] 

= VeE [Z] + cE [C (Ve logP(a;)Z - V^E [Z])] - cV^E [^E [Z] 


= VeE [Z] 
= VeE [Z] 


SD(Z) 

c 

SD(Z) 


E [(Z - E [Z])+ (Ve logP(u;)Z - V^E [Z])] - cV^E [e]E [Z] 
E [(Z - E [Z])+ (Ve logPH(Z - E [Z]) - V^E [Z])]. 


□ 


Proposition |4.3| naturally leads to a sampling-based gradient estimation algortihm, which we term 
GMSD (Gradient of Mean Semi-Deviation). The algorithm is described in Algorithm[T] 

C Consistency Proof 

Let [VLsaa, ^saAj Psaa) denote the probability space of the SAA functions (i.e., the randomness 
due to sampling). 

Let denote the Lagrangian of the SAA problem 

- ^ A^(e)/e(C,P.;7v) - 5] A^(*)/.(C,Pfl;iv). 

e^£ i^X 

Recall that S C x M x x denotes the set of saddle points of the true Lagrangian 
Let Sn C X K X x denote the set of SAA Lagrangian ( [T5| ) saddle points. 

Suppose that there exists a compact set C = x C\, where C and Ca C R x RI^I x r 1^' 
such that; 


(15) 
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Algorithm 1 GMSD 
1: Given: 

• Risk level c 


• An i.i.d. sequence ..., Pe- 

2: Set 

_ 1 N 

e[z1 = ^E- 

i=l 

3: Set 

_ / 1 ^ _ 

SD(Z)= _^(^,-E[Z])^ 

\ i=i 

4: Set 

_ 1 N 

VeE [Z] = log P(zi)zi. 

i=l 

5; Return: 



, N 

Vep(Z) = V^] + V(zz -e1z])+ (VelogP(z,)(^z 

SD(Z) ^ ^ ^ 


E[Z]) - V^]) . 


(i) The set of Lagrangian saddle points 5 C C is non-empty and bounded. 

(ii) The functions /e(^, Pe) for all e G £ and fi{^, Pe) for alH G Z are finite valued and continuous 

(in 0 on C^. 

(iii) For N large enough the set Sn is non-empty and Sn C C w.p. 1. 

Recall from Assumption |2.2| that for each fixed ^ G B, both fi{^,p) and ge{^,p) are continuous in 
p. Furthermore, by the S.L.L.N. of Markov chains, for each policy parameter, we have Pg at —>^ Pg 
w.p. 1. From the definition of the Lagrangian function and continuity of constraint functions, one 
can easily see that for each (0A^,A^,A^) G x M x x Pg;Ar(0 A^, A^, A^) —>■ 
Lg(0A^,A^A^) w.p. 1. Denote with D {A, B} the deviation of set A from set B, i.e., D {A, P} = 
sup^.^^ infygB II 2 ; — y\\- Further assume that; 

(iv) If G U{Pe;N) and converges w.p. 1 to a point 0 then ^ G U{Pe). 

According to the discussion in Page 161 of ll32l . the Slater condition of Assumption |2.2| guarantees 
the following condition: 

(v) For some point ^ G P there exists a sequence ^at G ^(Pg;Af) such that ^ ^ w.p. 1, 

and from Theorem 6.6 in ll3^ . we know that both sets U{Pe-,N) and U[Pe) are convex and compact. 
Furthermore, note that we have 

(vi) The objective function on ([T]) is linear, finite valued and continuous in ^ on Cj (these conditions 

obviously hold for almost all w G G in the integrand function ^{uj)Z{uj)). 

(vii) S.L.L.N. holds point-wise for any 0 

From (i,iv,v,vi,vii), and under the same lines of proof as in Theorem 5.5 of ll32l . we have that 

Pn{Z) —>■ p{Z) w.p. 1 as A/ — 00 , (16) 

D {Vn, P} —>■ 0 w.p. 1 as iV —>• 00 , (17) 

In part 1 and part 2 of the following proof, we show, by following similar derivations as 
in Theorem 5.2, Theorem 5.3 and Theorem 5.4 of ll32]| . that Le-N^Q-N, K'n’^ e'^) 
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Lg{^g, Xg’^, Xg’^, Xg’^) w.p. 1 and D{5Ar,5} —>■ 0 w.p. 1 as —>■ c». Based on the definition 
of the deviation of sets, the limit point of any element in iSat is also an element in S. 

Assumptions (i) and (iii) imply that we can restrict our attention to the set C. 


Part 1 We first show that KfN^ 

w.p. I as N —i' 00 . 


A*:^) converges to Lg{Q, X*g''^, X*g’^, X*g’^) 


For each fixed (A^,A^,A^) S Cx, the function Lg{^, X^, X^, X^) is convex and continu¬ 
ous in Together with the point-wise S.L.L.N. property, Theorem 7.49 of implies 
that L6/;Ar(^, A^, A^, A^) — A^, A^, A^) A 0, where A denotes epi-convergence. Fur¬ 

thermore, since the objective and constraint functions are convex in ^ and are finite val¬ 
ued on Cf, the set domLe(-, A^, A^, A^) has non-empty interior. It follows from Theorem 
7.27 of that epi-convergence of to Lg implies uniform convergence on C^, i.e., 

sup^gQ \Lg.N{^, X^, X^, X^) — Lg{^, X^, X^, X^)\ < e. On the other hand, for each fixed 
^ S C^, the function Lg{^,X^,X^,)?) is linear and thus continuous in (A^,A^,A^) and 
domLe(^, •,•,•) = M X x has non-empty interior. It follows from analogous arguments that 
sup(AT>.A^:,A^)GCj, \Le-N{i,^ tX^,X?-) — Le(^, A^, A^, A^)| < e. Combining these results implies 
that for any e > 0 and a.e. losaa £ ^saa there is a N*{e, ujsaa) such that 

sup |Le.^(e,A^,A^A^)-LeAA^,A^A^)| <e. (18) 

(C.A-P.A^.A^lGC 


Now, assume by contradiction that for some N 

Le-,Nia,N,K-^,KfN,K'^N) - > £■ 

die points 


> N*{e,uJsAA) we have 
Then by definition of the sad- 


Le-N^ig- 




\*,V \*,£ \*,X 
Afl ,An , A 


)>A;tv(atvA;;]^,A;i,A;:^) 


' 6 »; 7 V’ 
\*,E 


contradicting ([TSll. 


> Lg{^g, Xg'", Xg’^, Xg’^) -f e > Lg{Q.j^, Xg’'^, Xg’^, Xg^) + e, 


Similarly, 


assuming 


by 


contradiction 


that 


Ag!jYi Ag!^, Xg’.jsf) > e gives 
r fc* \*,£ \*X 1 \ r ( c*- \*XP \*,£ \*,^\ 

^e(<;g, Ag.j^, Ag.j^, Ag.j^) ^ -Lg(l;g, Ag , Ag , Ag ) 


T 1C* \*t'P \*^£ \*XE\ 

) Xg , Ag ) 


> Lg-N{^g-Nj Ag!^, Ag’^, Xg’^) + €> Z/g;jv(A) AgVi Ag-’^v) AgV) + e, 




also contradicting ([T8|). 


It follows that 


r /<r* r fc*- 

^a^ j 


* \*’'p 


N* (e, ujsaa), and therefore 


< e for all N > 


lim LgM^N, K’^, KfN^ K:n) = LeiCe, Xl’’^ ,Xf ,X;-^), 


N—^oo 


(19) 


w.p. 1. 


Part 2 Let us now show that DjiSivA} 0- We argue by a contradiction. Sup¬ 

pose that Dj^ATjiS} ^ 0. Since C is compact, we can assume that there exists a se¬ 
quence (Ce-ATj Ag-jv’Ag-ATi Ag.’w) ^ that converges to a point (f*, A*’^, A*’^, A*’^) € C and 
A*’^, A*’^, A*’^) ^ S. However, from ( fT7| ) we must have that G V. Therefore, we must have 

that 


Lgie,X*’^, A*’^ A*’^) > Lg(r, A;-^, A^’^ A^’^), 


by definition of the saddle point set. 


Now, 


T 1C* \*,'P \*,E \*,-E \ T (C* \*,'P \*,E \*,^\ 

r^e;Af(?g;Ar) Ag.^, Ag.jY) Xg.^J i*g(i; ,A ,A ,A ) 

T (C* \*,'P \*,£ \*,I \ T (c* \*,'P \*,£ \*X \ 

Ag.^, Ag.jy, Ag.^J Xg.^, Ag.^, Ag.jyj 

_L T 1C* \*’^ t r (c* \*,^ \*,^ \*,^\ 

+ ^s(sg;Ari Xg.^, Ag.jy, Ag.jy) ,A ,A ,A ) 


( 20 ) 
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The first term in the r.h.s. of ( |20l i tends to zero, using the argument from ( fTS] ), and the second 
by continuity of Lg guaranteed by (ii). We thus obtain that Lg; Jv(^e;Ar> K'^n^ K’n’ K’n) tends to 
A*’^, A*’^, A*’^) > Lg{Q, Xg'^, , Xg"^), which is a contradiction to ([T^. 


Part 3 We now show the consistency of Vg; NP{Z). 


IS 


Consider Eq. f). Since V6ilogP(- 
Ve(7e(s P 9 ) ^re bounded by Assumption 2.2 
have that for a.e. ujsaa & ^saa 


bounded by Assumption |4.1[ and "S/gfi{-]Pg) and 
and using our previous result D{5Ar,5} —>■ 0, we 


N—¥00 


]i^J^g-,Np{Z) = Y, PeMa(w)VfllogP(w)(Z(u;) - A^’^) 
-Y^e''(^)^09e{Cg\Pe) 

eGS 

-YK'^(^)"^eMCe;Pe) 


iei 

= ^gp{Z). 


where the first equality is obtained from the Envelop theorem (see Theorem 4.2 1 

with , Ag’^, Ag’^, Ag’^) S Sn H 5 is the limit point of the converging sequence 

SIC* \*’^ \*’^ \*’^ tr 

^6(;7V’ '^e;N'> '^g-NisNen- 

D Proof of Theorem 15.21 


Similar to the proof of Theorem 4.2 recall the saddle point definition of Ag’^, A^’f, X^) € S 
and strong duality result, i.e.. 


max 


4 :4-P0('k)6W(a:,Pe(-|a;)) f 

x'^Pc 


Y ^ix')Pe{x'\x)Vg{x') = -mas<i min Lg^^{^, X^, X^, X^) 


= min maxLe a;(?i A^, A*, A^). 
?>0 


the gradient formula in ( [TOl i can be written ; 

\/gVg{x) = X/g 

= xY fe.x(^0Pe(a;'|a;)Vel4(a;') + Y P0{a\x)X7g\ogpe{a\x)he{x,a), 


Cg{x)+^ max Ef[Ve] 

^■iPg{-\x)GU{x,Pei-\x)) 


x'ex 


aGA 


where the stage-wise cost function hg{x,a) is defined in ( |26] l. By defining hg{x) = 
Yla&A logpg{a\x)hg{x, a) and unfolding the recursion, the above expression implies 


VeV'e(a;o) =/ie(xo) + 7 Y^ Pe{xi\xQ)Q{xi) 


hg{xi)+^ Y^ PB{x 2 \xi)Ce{x 2 )X/gVg {X 2 ) 

X2GP(! 


Now since VgVg is continuously differentiable with bounded derivatives, when t -A 00 , one 
obtains j^^X/gVolx) -A 0 for any x G X. Therefore, by Bounded Convergence Theorem, 
lim(_>,oo p('y*Vg(xt)) = 0, when xq = x the above expression implies the result of this theorem. 


E Gradient Formula for Dynamic Risk - Full Results 

In this section, we first derive a new formula for the gradient of a general Markov-coherent dynamic 
risk measure V g poo (X4) that involves the value function of the risk objective p^o {M ) (e.g., the value 
function proposed by l30l). This formula extends the well-known “policy gradient theorem” ll34l[T7ll 
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developed for the expected return to Markov-coherent dynamic risk measures. Using this formula, 
we suggest the following actor-critic style algorithm for estimating ePao{-M): 


Critic: For a given policy 6, calculate the risk-sensitive value function of Pao{M) (see Section E.3 1 , 

and _ 

Actor; Using the critic’s value function, estimate Vgpoo{M.) by sampling (see Section E.4i. 


The value function proposed by lIMl assigns to each state a particular value that encodes the long¬ 
term risk starting from that state. When the state space X is large, calculating the value function 
by dynamic programming (as suggested by m) becomes intractable due to the “curse of dimen¬ 
sionality”. Eor the risk-neutral case, a standard solution to this problem is to approximate the value 
function by a set of state-dependent features, and use sampling to calculate the parameters of this 
approximation 0. In particular, temporal difference (TD) learning methods are popular for 
this purpose, which have been recently extended to robust MDPs by llJTl . We use their (robust) TD 
algorithm and show how our critic use it to approximates the risk-sensitive value function. We then 
discuss how the error introduced by this approximation affects the gradient estimate of the actor. 


E.l Dynamic Risk 


We provide a multi-period generalization of the concepts presented in Section 2.1 Here we closely 
follow the discussion in IMIl . 


Consider a probability space (U, IF, Pg), a filtration C C ■ C Pt C P, and an adapted 
sequence of real-valued random variables Zt, f € {0,..., T}. We assume that J'o = 0}. i-e-^ Zo 

is deterministic. Eor each t G {0,... ,T}, we denote by Z* the space of random variables defined 
over the probability space (fl,Pt, Pg), and also let Zt^r '■= Zt x ■ ■ ■ x Zt be a sequence of these 
spaces. The sequence of random variables Zt can be interpreted as the stage-wise costs observed 
along a trajectory generated by an MDP parameterized by a parameter 9, i.e., Zq^t = (Zq = 
'}^C(xo, ao),... ,Zt = 'y^C(xT, ut)) C -Zq.t- 

In particular, we are interested in the sequence of random variables induced by the trajectories from 
a Markov decision process (MDP) parameterized by parameter 9. 

Explicitly, for any t > 0 and state dependent random variable Z{xt+i ) € Zt+i, the risk evaluation 
is given by 


p{Z{xt+i)) 


max 

€ : ^Pei-\xt)GUlxt,P0(-\xt)) 


Eg [Z{xt+i)], 


( 21 ) 


where we let U{xt, Pg{-\xt)) denote the risk-envelope (|^ with Pg replaced with Pg{-\xt). The 
Markovian assumption on the risk measure pt{M.) allows us to optimize it using dynamic pro¬ 
gramming techniques. 


E.2 Risk-Sensitive Bellman Equation 

Our value-function estimation method is driven by a Bellman-style equation for Markov coher¬ 
ent risks. Let B{X) denote the space of real-valued bounded functions on X and Cg{x) = 
(7(a;, a)/ie(a|a:) be the stage-wise cost function induced by policy p,g. We now define the 
risk sensitive Bellman operator Tg[V] : B{X) B{X) as 

Tg[V]{x):=Cg{x)+-f 

^Pe{-\x)GU{x,Pe{-\x)) 

According to Theorem 1 in OOl . the operator Tg has a unique hxed-point Vg, i.e., Tg\Vg\{x) = 
Vg{x), \/x G X, that is equal to the risk objective function induced by 9, i.e., Ve{xo) = Poo(-M)- 
However, when the state space X is large, exact enumeration of the Bellman equation is intractable 
due to “curse of dimensionality”. Next, we provide an iterative approach to approximate the risk 
sensitive value function. 
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E.3 Value Function Approximation 

Consider the linear approximation of the risk-sensitive value function V 0 {x) « (p{x), where 

(/>{■) G is the A{ 2 -dimensional state-dependent feature vector. Thus, the approximate value 
function belongs to the low dimensional sub-space V = {<i>u|t; S where $ : A —?> is a 

function mapping such that $(a:) = (j){x). The goal of our critric is to find a good approximation of 
Vg from simulated trajectories of the MDR In order to have a well-defined approximation scheme, 
we first impose the following standard assumption 1 ^ . 

Assumption E.l. The mapping $ has full column rank. 


For a function y : A —>■ K, we define its weighted (by d) f 2 -norm as || 2 /||d = s/'llx' d{x'\x)y{x'Y, 
where d is a distribution over X. Using this, we define H : X V, the orthogonal projection from 
K to V, w.r.t. a norm weighted by the stationary distribution of the policy, dg{x'\x). 


Note that the TD methods approximate the value function Vg with the fixed-point of the joint oper¬ 
ator liTg, i.e., Vg{x) = (j){x), such that 


Vx S X, 


Vg{x)=UTg[Vg]{x). 


(23) 


From Eq. 21 that has been derived from Theorem |2.1| for dynamic risks, it is easy to see that the risk- 
sensitive Bellman equation ( |22l i is a robust Bellman equation ll^ with uncertainty set if (a;, Pg{-\x)). 
Thus, we may use the TD approximation of the robust Bellman equation proposed by llJTl to find an 
approximation of Vg. We will need the following assumption analogous to Assumption 2 in l37l . 


Assumption E.l. There exists n G {0,1) such that ^{x') < K/"f,forall^{-)Pg{-\x) GU{x, Pg{-\x)) 
and all x, x' G X. 


Given Assumption |E.2[ Proposition 3 in llJTl guarantees that the projected risk-sensitive Bellman 
operator IlTg is a contraction w.r.t. dg-norm. Therefore, Eq. 23 has a unique fixed-point solution 
Vg{x) = (j){x). This means that Vg G satisfies Vg G argmin„ ||Tg[<i)u] — By the 

projection theorem on Hilbert spaces, the orthogonality condition for Vg becomes 


'^dg{x\xo)(l){x)cj){x)^ Vg = 




E 

xex 


+ 7 E dg{x\xo)(j){x) 

xGX 


dg{x\xo)4>{x)Cg{x) 


max 


C : iPs(-\x)eUix,Pe{-\x)) 


As a result, given a long enough trajectory xq, oq, xi, oi,..., xn-i, cin-i generated by policy 6 , we 
may estimate the fixed-point solution Vg using the projected risk sensitive value iteration (PRSVI) 
algorithm with the update rule 

, -1 


Vk+i = ( 4 E 1 ^ Hxt)Ce{xt) 


N-1 


N 


N-1 


1 


N-1 


N 


t^o 


+ 2 ']^ E 




max 

^P0{-\xt)GU{xt,Pe{-\xt)) 




(24) 


Note that using the law of large numbers, as both N and k tend to infinity, Vk converges w.p. 1 to 
Vg, the unique solution of the fixed point equation nT6)[$u] = <i)u. 

In order to implement the iterative algorithm ( |24l l, one must repeatedly solve the inner optimiza¬ 
tion problem max^Pj,(.| 3 ;)giY(a; E^[$u]. When the state space X is large, solving this opti¬ 

mization problem is often computationally expensive or even intractable. Similar to Section 3.4 
of lEl, we propose the following SAA approach to solve this problem. Eor the trajectory, 
Xq, oq, Xi, Oi, ..., xn-1, cln-i, we define the empirical transition probability PM{x'\x,a) = 




l{xt=x, at=a, Xt+i=x'} 




at=a} 


and Pe;Ar(x'|x) = ®)Fe(a|x). Consider the follow¬ 


ing £ 2 -regularized empirical robust optimization problerr0 


"'in the case when the sizes of state and action spaces are huge or when these spaces are continuous, the 
empirical transition probability can be found by kernel density estimation. 

^In the SAA approach, we only sum over the elements for which Pg.]\i{x'\x) > 0, thus, the sum has at most 
N elements. 
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Pn(^v) = max Pg-N(x'\x)£(x')d>^(x')v 

li-4Pe;i^eu(x,Pg.N) 

x'GX 

+ ^[Pb-,n{x\x)£{x')Y. ( 25 ) 

As in EOl . the £ 2 -regularization term in this optimization problem guarantees convergence of opti¬ 
mizers £* and the corresponding KKT multipliers, when N ^ 00 . Convergence of these parameters 
is crucial for the policy gradient analysis in the next sections. We denote by Q,x-N’ the solution of 

the above empirical optimization problem, and by ^S,x-N^ ^e\x-N^ \',x-N^ the corresponding KKT 
multipliers. 


We obtain the empirical 


PRSVl algorithm by replacing the 


24 


with pjv(4>w) from Eq. 


25 


in Eq. 

N and k tend to infinity, Vk converges w.p. iTo" Vg. More details can be found 
material. 


inner optimization 
Similarly, as both 
in the supplementary 


E.4 Gradient Estimation 


In Section E.3 we showed that we may effectively approximate the value function of a fixed policy 
9 using the (empirical) PRSVl algorithm in Eq. 24 In this section, we first derive a formula for the 
gradient of the Markov-coherent dynamic risk measure poo (■^)j und then propose a SAA algorithm 
for estim ating this gradient, in w hich we use the SAA approximation of value function from Sec- 

(Af) = Vg{xo), and thus, we shall first derive a formula 


tion 

fort7 


E.3 


As described in Section E.2 

ixo). 


Let {£1^, Ag’f, Ag’^) be the saddle point of ^ corresponding to the state x G X. In many 

common coherent risk measures such as CVaR and mean semi-deviation, there are closed-form 
formulas for Q ^ and KKT multipliers (A^’^, Ag’f, Ag’^). We will briefly discuss the case when the 
saddle point does not have an explicit solution later in this section. Before analyzing the gradient 
estimation, we have the following standard assumption in analogous to Assumption |4. 1 | of the static 
case. 


Assumption E.3. The likelihood ratio g logpe(a|a:) is well-defined and bounded for all x G X 
and a G A. 


As in Theorem |4. 2 1 for the static case, we may use the envelope theorem and the risk-sensitive Bell¬ 
man equation, Vg(x) = Cg(x)-h ^max ip„(.ix]£U(x.Pg(-lx)) to derive a formula for VgVg(x). 

We report this result in Theorem |E.4| which is analogous to the risk-neutral policy gradient theo¬ 
rem ll34lfT7l f71l. The proof is in the supplementary material. 

Theorem E.4. Under Assumptions\2.2\ we have 


VV,(x)=E4. 


00 

^ 7 log 110 {at I Xi) he {xt ,at) \ xo=x 


where E^* [•] denotes the expectation w.r.t. trajectories generated by a Markov chain with transition 
probabilities Pe(-|a;)^g ^(•), and the stage-wise cost function he{x, a) is defined as 


hg{x,a) 


C{x,a)-\- P{x'\x,a)£l^{x')^yVg{x') 

x'GX 


YKiii) 

i£X 


dfi{Cg,x,P) 

dp(x') 


e££ 


dp{x') \' 


( 26 ) 


Theorem |E.4| indicates that the policy gradient of the Markov-coherent dynamic risk measure 
Poo{Xi), i.e., Vepoo(A4) = VgVg, is equivalent to the risk-neutral value function of policy 9 
in a MDP with the stage-wise cost function Vg\ogpQ{a\x)hQ{x,a) (which is well-defined and 
bounded), and transition probability Pe{-\x)Q Thus, when the saddle points are known and the 
state space X is not too large, we can compute VgVg using a policy evaluation algorithm. However, 
when the state space is large, exact calculation of VI 4 by policy evaluation becomes impossible. 
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and our goal would be to derive a sampling method to estimate VV^. Unfortunately, since the risk 
envelop depends on the policy parameter 9, unlike the risk-neutral case, the risk sensitive (or robust) 
Bellman equation Tg[V 0 ]{x) in ( [2^ is nonlinear in the stationary Markov policy fig. Therefore hg 
cannot be considered using the action-value function (Q-function) of the robust MDR Therefore, 
even if the exact formulation of the value function Vg is known, it is computationally intractable to 
enumerate the summation over x' to compute hg{x, a). On top of that in many applications the value 
function Vg is not known in advance, which further complicates gradient estimation. To estimate the 
policy gradient when the value function is unknown, we approximate it by the projected risk sen¬ 
sitive value function <l>Ug. To address the sampling issues, we propose the following two-phase 
sampling procedure for estimating Wg. 

(1) Generate N trajectories aQ\x^-^\af\ .. -IjLi from the Markov chain induced by policy 
9 and transition probabilities Pg (•W ■■=QA-)Pb{M)- 

(2) For each state-action pair a), generate N samples using the transi¬ 

tion probability F’( jx, a) and calculate the following empirical average estimate of hg{x, a) 

■s^ ^*,1 f '^5e(^S,xiP) 

- ^ W dpiyW) - ^ ^ 

(3) Calculate an estimate of VVg using the following average over all the samples: 

F J2Zo 

Indeed, by the definition of empirical transition probability Pjv(x'|x, a), hg f^{x,a) can be re¬ 
written as in the same structure of hg{x, a), except by replacing the transition probability P{x'\x, a) 
with Pn{x'\x, a). 

Furthermore, in the case that the saddle points {Q x, ^g’^, ^g’^, ^g’^) do not have a closed-form 
solution, we may follow the SAA procedure of Section |E3| and replace them and the transition prob¬ 
abilities P{x'\x, a) with their sample estimates {^g K'Ln^ K'Ln^ K’Ln) and Pn{x'\x, a) re¬ 
spectively. 

At the end, we show the convergence of the above two-phase sampling procedure. Let dpi (x|xo) 
and TTps (x, alxo) be the state and state-action occupancy measure induced by the transition 
probability function Pg{-\x), respectively. Similarly, let dpi (x|xo) and iTpi (x,a|xo) be the 

&; N 0 ; N 

state and state-action occupancy measure induced by the estimated transition probability function 
Pg.pf{-\x) := Q x-Ni')PB-,N{‘\x). From the two-phase sampling procedure for policy gradient es¬ 
timation and by the strong law of large numbers, when N —> oo, with probability 1, we have 
that = x,a\^'^ = a} = iTpi (x, a|xo). Based on the strongly convex 

property of the £ 2 -regularized objective function in the inner robust optimization problem p 7 v(^’x), 
we can show that both the state-action occupancy measure TTpj (x, a|xo) and the stage-wise cost 

^6\N ^ 

hg.^{x,a) converge to the their true values within a value function approximation error bound 
A = ||$Ug — Velloo- We refer the readers to the supplementary materials for these technical results. 
These results together with Theorem |E.4| imply the consistency of the policy gradient estimation. 

Theorem E.5. For any Xq S X, the following expression holds with probability 1: 


he, Nix, a) := Cix,a) -|- ^ ^ 

^ k=l 


N ^ 00 I\ ^^ ^^ 


j=l t=0 

-VVeixo) 


= 0(A). 


Thm. E.5 guarantees that as the value function approximation error decreases and the number of 
samples increases, the sampled gradient converges to the true gradient. 
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F Convergence Analysis of Empirical PRSVI 


Lemma F.l (Technical Lemma). Let T’(-|-) and P(-|-) be two arbitrary transition probability ma¬ 
trices. At state X G X, for any ^ : ^P{-\x) G U{x, P(-|a;)), there exists a > 0 such that for 
some ^ : ^P{-\x) G U{x, P{-\x)), 

x'^X x'^X 


Proof From Theorem 2.1 we know that lA(x, P(-|a;)) is a closed, bounded, convex set of proba¬ 


bility distribution functions. Since any conditional probability mass function P is in the interior of 
dom(Z^) and the graph of Z^(a;, P(-|x)) is closed, by Theorem 2.7 in 1^ . U{x,P{-\x)) is a Lip- 
schitz set-valued mapping with respect to the Hausdorff distance. Thus, for any ^ : ^P(-|a;) G 
U{x^ P{-\x)), the following expression holds for some > 0: 


inf 

iGUix,PG\x)) 


- CV)I <M^Y - P{x'\x) 

x'GX x'GX 


Next, we want to show that the inhmum of the left side is attained. Since the objective function is 
convex, and (a;, P(jx)) is a convex compact set, there exists ^ : ^P(-|x) S if (x, P(-|a;)) such 
that inhmum is attained. □ 


Lemma F.2 (Strong Law of Large Number). Consider the sampling based PRSVI algorithm with 
update sequence {ufc}. Then as both N and k tend to oo, Vk converges with probability 1 to Vg, the 
unique solution of projected risk sensitive fixed point equation IIT^ [$i;] = $i;. 


Proof By the strong law of large number of Markov process, the empirical visiting distribution and 
transition probability asymptotically converges to their statistical limits with probability 1, i.e.. 


T.Y"i{x,=x} 

N 


de{x\xf)^ and P{x'\x,a) —>■ P(a;'|a;, a), \lx,x' G X, a G A. 


Therefore with probability 1, 


Y (t>{xt)(t>{xtY Y de{.x\xo) ■ (j){x)(jA{x), 

t—0 X 

I Jv-i 

— Y, 4>{xt)Ce{xt) -G Y^s{x\xo) ■ (j>{x)Ce{x). 

t—0 X 


Now we show that following expression holds with probability 1: 


p. XI ^^^')Ps-,N{x\xt)v^(j){x) + —{^[x')Pe.N[x'\xt)) 


f : £,Pe-,NG\xt)GOi(xt,Pa,N(-\xt)) 

x'£X 


max Fix')Pe{x'\xt)v^ fix'). 


(27) 


^■XPeG\xt)&U(xt,Pe(-\xt)) ^ 

x' £X 


Notice that for € SlgT^^as^s_,^p^,^^^.\^^)^u{x^,Pe.,N{■\xt))'}lx'(iX ^i.x')Pe-N{x'\xt)v^ {x'). 

Lemma |F. 1| implies 


^■^Pe-,N{-\^t)&A{xt,Pe',N{-\xt)) “tX ^-/V 

x'^X 

— max I(x')Pgix'\xt)v^Six') 

C-^Pe(-W)GU{xt,Pe{,-\xt)) 

<|i$t;||oo +max|^;_^^.jv(a:)|'j Y \Pe(.x'\xt) - P0.,N{x'\xt)\ + 

^ ^ x'GX 
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The quantity \^0,xtiN (a;)| is bounded because U{xt, Pe-^Ni'lxt)) is a closed and bounded 

convex set from the definition of coherent risk measures. By repeating the above analysis by inter¬ 
changing Pg and Pg-N and combining previous arguments, one obtains 




max 

iPe',Ni-\xt)eU{xt,Pe;N{-\x:t)) 


^{x')Pe.N{x'\xt)v^(j){x') + 

x'GX 


1 


{ax')P0Mx'\xt)f 


max 


^ : ^Pe{-\xt)GU{xt,Pei-\xt)) 


^ix')Pg{x'\Xt)v^(l){x') 

x'GX 

<||T>u||oomax|^M5. -f m|x |r(a;)|^ , + m|x j 


y^ \Pg{x'\xt) - Pg-N{x\xt)\ 

x'^X 


Therefore, the claim in expression holds when N ^ oo and 

J2x'ex \P 0 {x'\xt) — Pe-,N{x'\xt)\ —^ 0. On the other hand, the strong law of large numbers 
also implies that with probability 1, 


1 

m' 


1 ^/ 

— y]] 0 (a;t)p($uO ^ de(a;|a;o)(/>(a;) max V i{x')Pe{x'\x)v*g^cl){x'). 

N f-" i-.^Pei x)eUix,Pe{- x)) 

t—0 x'G<< 


Combining the above arguments implies 

— V cl>{xt)pN{<^Vt) de{x\xo)(j){x) max V ^{x')Pg{x'\x)vg^(j) {x'). 

-'V ^ (:ePg{-\x}&U{x,Pe{-\x)) ^ 

t—0 x'GX 

As N —> oo, the above arguments imply that Vk — Vk —t 0. On the other hand. Proposition 1 in 
El implies that the projected risk sensitive Bellman operator nT6i[y] is a contraction, it follows 
that from the analysis in Section 6.3 in lO that the sequence {<i>Ufe} generated by projected value 
iteration converges to the unique fixed point This in turns implies that the sequence {$Ufc} 
converges to □ 


G Technical Results 


Since by convention Q = 0 whenever Pg.pf{x'\x) = 0. In this section, we simplify the 

analysis by letting Pg.^iy{x'\x) > 0 for any x' G X without loss of generality. Consider the following 
empirical robust optimization problem: 




max 

ff’8;]v(-|a;)eW(x,Ps.N('|a:)) 


E 


Pg.Nix'\x)^{x')Vg{x'), 


(28) 


where the solution of the above empirical problem is Q,x-,N and the corresponding KKT multipliers 
^E-Af)' Comparing to the optimization problem for i.e.. 


PAr($u) = max ,, ,, E P0-,N{x\x)^{x)(p{x')v+^{^{x)Pg.N{x'\x)f 

x'^Pl 


(29) 

where the solution of the above empirical problem is Q,x-,N and the corresponding KKT multipliers 
('^Eat’ ^s’x-n^ optimization problem in ( |28] l can be viewed as having a skewed 

objective function of the problem in ( |29] l, within the deviation of magnitude A -f 1/2N where 
A = ||$Ug — Velloo- Before getting into the main analysis, we have the following observations. 


(i) Without loss of generality, we can also assume {Q i^olxiN’ ^ 9 ,x:N’ ^Xxin)) follows the 
strict complementary slackness conditiorj^ 

®The existence of strict complementary slackness solution follows from the KKT theorem and one can easily 
construct a strictly complementary pair using i.e. the Balinski-Tucker tableau with the linearized objective 
function and constraints, in finite time. 
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(ii) Recall from Assumption |2.2| that the functions and ge{^,p) are twice differentiable in ^ 

atp = P 0 ^n{-\x) for any x G X. 

(iii) The Slater’s condition in Assumption |2.2| implies the linear independence constraint qualifica¬ 

tion (LICQ). 

(iv) Since optimization problem (|29]l has a convex objective function and convex/affine constraints 

in ^ G equipped with the Slater’s condition we have that the first order KKT condi¬ 
tion holds at Q with the corresponding KKT multipliers are Ag’f.^y, A^’^.^). 

Furthermore, define the Lagrangian function 

A'^,A^,A^) = P0.n{x'\x)^{x')(I)^{ x')v + ^{Pg;N{x'\x)^{x')f 

-A'^ ( C(a^')^e;iv(a:'|a;)-l| 

\x'ex ) 

-^A^(e)/e(e,P,;^(-|x))-^A^(z)/.(e,Pe;tv(-|a;)). 


One can easily conclude that V^Le;Af(C) A^, A^, A^) = —Pe-N{-\xy^Pe-N{-\x)/N — 
A^( 0 ^|/i (?5 Pe-Ni’\x)) such that for any vector v ^ 0 , 




e,x-,N> ^e’,x:N’ ^d’,x-N^ ^e\x\N)^ < 0 , 


which further implies that the second order sufficient condition (SOSC) holds at 

fc* \*,'P \*,£ \*,I \ 

\^e,x-,N^ '^e,x-,N’ '^ 0 ,x-,N’ ^e,x-,N)- 


Based on all the above analysis, we have the following sensitivity result from Corollary 3.2.4 in lua, 
derived based on Implicit Function Theorem. 

Proposition G.l (Basic Sensitivity Theorem). Under the Assumption \2.2\ for any x & X there exists 
a bounded non-singular matrix Kg^x ^ind a bounded vector Lg^x, such that the difference between 
the optimizers and KKT multipliers of optimization problem P8|) and \29) are bounded as follows: 




Q,x-,N 

^e,x-N 


'^0,x-N 

^e,x\N 


\*,P 

'^0,x;N 

_^e,x\N. 


ff0,x-,N_ 


+ ^e'x'^e^x 



+ o 



On the other hand, we know from Proposition 4.4 that Q x ^e’x-N^ 

a; ’ ^ 9 x 1 ^Bx) probability 1 as A^ —> oo. Also recall from the law of large numbers that 
the sampled approximation error maXx^x,aeA ll^(■|2;, a) — PAf(’| 2 ;, a)||i —>■ 0 almost surely as 
N —i' oo. Then we have the following error bound in the stage-wise cost approximation hg.ff{x, a) 
and 7— visiting distribution TrN{x,a). 

Lemma G.2. There exists a constant Mh > 0 such that vaayix^x,a^A\^ 9 {x,a) — 

limAT-foo hg.^jM{x,a)\ < MhA. 


Proof First we can easily see that for any state x G X and action a G A, 


\hg.Nix,a)-hgix,a)\<MY\K%.Ni^-K’^i^) + MY\^e’,i,Ni(^) - 

iex 

+ l\\VB\U\Ce,x-,N-CgJi- 
+ 7l|V'e||oomax{||^2_^.^||oo, 


'^e,x:N '^9 


*,7^ 


e^S 

A\\Vg-^v;\U 

ICS.a;lloo}||P(-|a;,a) -Pjv(-|a:,a)||i. 


Note that at N ^ 00 , |jP(-|a:,a) — PAr(-|a:, a)||i —)■ 0 with probability 1. Both ||Cg.jvlloo and 
IICSxIloo finite valued because U{Pg) and U{Pg.^N) are convex compact sets of real vectors. 
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Therefore, by noting that ||V6i||oo < C'max/(1 — 7 ) and applying Proposition |4.4| and |G.l| the proof 
of this Lemma is completed by letting N —>■ 00 and defining 


Mh[x) =max{l,M, 


7^ 


1-7 


< ( max{ 1, M. 




1-7 


Q,x-,N 

a: 


e,x:N 


Q,x-N 

N ~ ^0,x-N 


-A 




*,7^ 

9,x:N 


_ 

N '^9.x:N 




^9,x\N ^9 

-a: 


^9,x-,N 
\ *,X 
'^9,x-,N 

'^9,x\N 

'^9,x:N 


*,X 


\*,V 

^9,x 

-a:-! 


A. 


+ 7 A 


□ 


Lemma G.3. There exists a constant > 0 such that ||7r — limjv->.oo ’’’Af II 1 ^ M^^A. 


Proof. First, recall that the 7 —visiting distribution satisfies the following identity: 

7 X! dpi{x'\x)Pl{x\x') = dp^ix) - (1 - 7)l{a;o = x}, 

From here one easily notice this expression can be rewritten as follows; 

dp({-\x) = l{xo = x}, \/x e X. 

On the other hand, by repeating the analysis with Pg-N{‘\x), we can also write 

~ 'y^e-N^ K ~ "t^"t^o = 

Combining the above expressions implies for any x G X, 


(30) 


‘^Pe ^Pi.N ‘^Pe d.pi^ -Q, 


which further implies 


(/ (dp, 


— 1 I Pg Pg.j^ 


d p$ 

^6:N 


pI-pI 


\n) 


dp^ . 

^6-,N 


Notice that with transition probability matrix Pg{-\x), we have (/ — 7 -P|) ^ < 

00 . The series is summable because by Perron-Frobenius theorem, the maximum eigenvalue of Pg 
is less than or equal to 1 and I — jPg is invertible. On the other hand, for every given xq G X, 


{( 


Pi-Pi 


\n) \ (z') = Yl, = a:|a;o) - P|,^(z'|a;)) ,\lz' GX 

^ x^X k—0 

“ O') [pl{z'\xk) - Pl;Ni^'\xk)'^ |a:o I ,'iz' GX 


\k^0 

00 


“ O') Pl{z'\xk) - PYiz'\xk) |xo , Vz' e A" 


\ fe =0 

= Qiz'). € A'. 


Note that every element in matrix (/ — 7 Pg ) ^ non-negative. This implies for 

any z G X, 




Pg Pg-N 


dpi 

^e\N 


{z) 


< 


{i-iP!) ^7q}(^) ={(/-7f’|) ^7 q}(^)- 
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The last equality is due to the fact that every element in vector Q is non-negative. Combining the 
above results with Proposition |4.4|and|G.l| and noting that 




- 


oo 

E 


{iPe 


1-7 


we further have that 


Itt - tttvIIi =||dp? - dps 111 


Pgi-\x) - PInUx) 




1-7 

7 


<—^— max 
1 — 7 x^x 

IICe,xlloo}||-P(-k, a) - PAr(-|a;, a)||i) 


As in previous arguments, when N ^ oo, one obtains ||P(-|a;, a) — Pn{-\x, a)||i —?► 0 with proba¬ 
bility 1 and ||Ce.a;(')-Ce.a;; 7 v(’)lli ^ 0. We thus set the constant as 7ll^^,i^e.x||i/(l-7)- □ 
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